550 Lab 3ΒΆ

Submission InstructionsΒΆ

Complete the following questions either using ggplot in R or Altair in Python. Please note that at least one assignments must be submitted using R and at least one assignment must be submitted using Altair. That is, if you have submitted the first two assignments in Python, this lab must be completed using R. If you have submitted the first two assignments in R, this lab must be completed using Python. If you have submitted one assignments in R, and one assignment in Python than you can complete this assignment in whichever language you want. Submit your completed assignment to Canvas, that is:

  • rendered .html file(s) of your assignment and
  • the source file(s) (.Rmd and/or .ipynb)
InΒ [2]:
# Run this cell to ensure that altair plots show up in the exported HTML
# and that the R cell magic works
import altair as alt

# Save a vega-lite spec and a PNG blob for each plot in the notebook
alt.renderers.enable('default')
Out[2]:
RendererRegistry.enable('default')

Redesigning plots for CommunicationΒΆ

Throughout this module we have discussed how to (and how not to) visualize data. In this section you will see a series of visualzations and claims that you will need to improve and debunk. You can find the corresponding datasets in labs/data/ folder of the 550 student GitHub repo. For each of the following questions in this section, you are presented with the following three tasks:

  1. Identify what aspects of the plot are deficient and how the claim is misleading.
  2. Recreate the plot as close as possible to the images provided in the question. This includes axis labels, legends, colors, titles, figure size, etc.
  3. Create your own better version of the same figure. Make sure that your choices are motivated by what your wrote in point 1.

1. Job satisfactionΒΆ

DataΒΆ

The job-satisfaction.csv dataset contains (fictional) data from different cities around the world. This data set measures the proportion of working people who reported that they were unsatisfied with their work situation. Each row corresponds to a different city, and each column corresponds to a year.

My plotsΒΆ

AltairΒΆ

ggplotΒΆ

No description has been provided for this image

My claimΒΆ

The majority of cities have experienced have seen a rise in proportion of dissatisfied working people. I have used transparency for the dots to prevent overplotting/oversaturation, so this means that the entire dark area has an even amount of cities throughout. In other words, there are as many cities that increased from ~0% to ~40% dissatisfaction as there are cities that stayed around 0%.

Question 1.1ΒΆ

rubric={reasoning:2,writing:1}

  • Explain which mistakes (if any) I have done in my plot and provide an explanation as to why you think so.
  • Provide a more accurate interpretation of the plot. Be brief and clear in your answer. While you might have some ideas just from looking at the plot, you will likely need to do some investigation yourself to answer this question fully. If appropriate, you can reference your improved plot from the following question.
   

YOUR ANSWER HERE

The x-axis is too narrow, which leads to a large concentration of data points in the bottom left corner - making it seem like an upward trend. However, if you change the graph width such that the height and width are equal, the trend is not as significant, and considering points above the 0.5 level - dissatisfaction from last year even comes down compared to this year. Thus, the original conclusion is inaccurate, and the improved graph shows a more realistic picture. In addition, looking at the improved histogram, which shows the differences between This and Last year - we see the majority of differences are very close to 0 or negative indicating dissatisfaction has slightly come down/remained the same.

In addition, the axis titles should also be cleaned up and more meaningful as to what they measure and the subtitle should use more appropriate language.

Question 1.2ΒΆ

rubric={accuracy:2}

Recreate this plot using Altair or ggplot. Your plot should look as close as possible to mine in the image above

InΒ [3]:
import pandas as pd
InΒ [100]:
# YOUR ANSWER HERE
df = pd.read_csv('job-satisfaction.csv')
alt.Chart(df
).mark_circle(opacity=0.6, size=100).encode(
    x=alt.X('last_year', axis=alt.Axis(titleFontSize=16)),
    y=alt.Y('this_year', axis=alt.Axis(titleFontSize=16))).properties(
    title=alt.TitleParams(
        text='Dramatic increase in dissatisfaction at work',
        subtitle="Can't get no job satisfaction",
        anchor='start',
        fontSize=30 
    ),
    height=600,
    width=400
)
Out[100]:

Question 1.3ΒΆ

rubric={viz:2}

Create a new and improved visualization to aid in understanding this data. This plot should combat the issues you described in 1.1 and attempt to debunk my claim. Design this visualization to be an effective plot for communication purposes, this includes proper figure size, axis/figure titles, font sizes, colors, etc. Note that there is no "right" answer here.

InΒ [101]:
# YOUR ANSWER HERE
df = pd.read_csv('job-satisfaction.csv')
df['diff'] = df['this_year'] - df['last_year']
a = alt.Chart(df
).mark_bar().encode(
    x=alt.X('diff', title='Difference in Dissatisfaction between This Year and Last Year', axis=alt.Axis(titleFontSize=15), bin=alt.Bin(maxbins=20)),
    y=alt.Y('count()', title='Frequency', axis=alt.Axis(titleFontSize=16))).properties(
    title=alt.TitleParams(
        text='Dissatisfaction comes down a little',
        subtitle="A possible cause is thought to be the proposed new 4-day work week",
        fontSize=30 
    ),
    height=400,
    width=400
)

b = alt.Chart(df
).mark_circle(opacity=0.6, size=100).encode(
    x=alt.X('last_year', title='Last Year', axis=alt.Axis(titleFontSize=16)),
    y=alt.Y('this_year', title='This Year', axis=alt.Axis(titleFontSize=16))).properties(
    title=alt.TitleParams(
        text='Dissatisfaction comes down a little',
        subtitle="A possible cause is thought to be the proposed new 4-day work week",
        fontSize=30 
    ),
    height=600,
    width=600
)

b & a
Out[101]:

2. RatesΒΆ

The dataΒΆ

The rates.csv dataset contains (fictional) data surrounding the wages of private tutors. Each datapoint represents the average rate at a particular district before COVID-19 forced teachers to go online and after.

My plotsΒΆ

AltairΒΆ

ggplotΒΆ

No description has been provided for this image No description has been provided for this image No description has been provided for this image

My claimΒΆ

There has been a significant increase in the average rate for private tutors. This implies that all (or almost all) private tutors are receiving a higher payment than what they were before COVID-19. This finding is even true when we look at the robust estimates or the median increase! Notably, the groups are balanced in that they have the same sample size.

Question 2.1ΒΆ

rubric={reasoning:2,writing:1}

  • Explain which mistakes (if any) I have done in my plot and provide an explanation as to why you think so.
  • Provide a more accurate interpretation of the plot. Be brief and clear in your answer. While you might have some ideas just from looking at the plot, you will likely need to do some investigation yourself to answer this question fully. If appropriate, you can reference your improved plot from the following question.

YOUR ANSWER HERE

For the mean and median graphs, the x-axis does not start at 0, thereby making the difference between before and after look more significant than it is. Changing the graphs so they start from 0 shows the average salary for private tutors did increase but not as significantly as what is portrayed in the original graphs.

For the y-axis, it does not need to say when for each plot as its clear from the before/after label - so this can be removed. There should also be a descriptive title for the graph.

Question 2.2ΒΆ

rubric={accuracy:2}

Recreate this plot using Altair or ggplot. Your plot should look as close as possible to mine in the image above.

InΒ [4]:
# YOUR ANSWER HERE
df = pd.read_csv('rates.csv')
mean = alt.Chart(df).mark_bar().encode(
    x=alt.X('mean(rates)', scale=alt.Scale(domain=[16, 24])),
    y=alt.Y('when', sort='x'), 
color=alt.Color('when', legend=None))

median = alt.Chart(df).mark_bar().encode(
    x=alt.X('median(rates)', scale=alt.Scale(domain=[16, 32])),
    y=alt.Y('when', sort='x'), 
color=alt.Color('when', legend=None))

offset = alt.Text(dx=5, dy=0)

count = alt.Chart(df).mark_bar().encode(
    x=alt.X('count(rates)'),
    y=alt.Y('when', scale=alt.Scale(domain=['before', 'after'])), 
color=alt.Color('when', legend=None)
)

count = count + count.mark_text(align='left', baseline='middle').encode(text='count(rates):Q')

((mean | median) & count)
Out[4]:

Question 2.3ΒΆ

rubric={viz:2}

Create a new and improved visualization to aid in understanding this data. This plot should combat the issues you described in 2.1 and attempt to debunk my claim. Design this visualization to be an effective plot for communication purposes, this includes proper figure size, axis/figure titles, font sizes, colors, etc. Note that there is no "right" answer here.

InΒ [9]:
# YOUR ANSWER HERE
df = pd.read_csv('rates.csv')
mean = alt.Chart(df).mark_bar().encode(
    x=alt.X('mean(rates)'),
    y=alt.Y('when', sort='x', title=None), 
color=alt.Color('when', legend=None))

median = alt.Chart(df).mark_bar().encode(
    x=alt.X('median(rates)'),
    y=alt.Y('when', sort='x', title=None), 
color=alt.Color('when', legend=None))

offset = alt.Text(dx=5, dy=0)

count = alt.Chart(df).mark_bar().encode(
    x=alt.X('count(rates)'),
    y=alt.Y('when', title=None, scale=alt.Scale(domain=['before', 'after'])), 
color=alt.Color('when', legend=None)
)

count = count + count.mark_text(align='left', baseline='middle').encode(text='count(rates):Q')

((mean | median) & count).properties(
    title=alt.TitleParams(
        text='Private tutor rates increase slightly after Covid-19',
        fontSize=20, anchor='middle')
    )
Out[9]:

3. Gun ViolenceΒΆ

The dataΒΆ

The guns-violence.csv dataset contains yearly data for the number of murders by firearms in Florida. This is based on real data, but I have drawn out these dots by hand so they are not exact. The original chart (shown below) has attracted a lot of negative attention, for reasons which you will discover upon completing this question.

image.png

My plotsΒΆ

AltairΒΆ

image.png

ggplotΒΆ

image.png

My claimΒΆ

Just after Florida enacted its "stand your ground" self-defense law in 2005, the deaths from firearms plummeted.

Question 3.1ΒΆ

rubric={reasoning:2,writing:1}

  • Explain which mistakes (if any) I have done in my plot and provide an explanation as to why you think so.
  • Provide a more accurate interpretation of the plot. Be brief and clear in your answer. While you might have some ideas just from looking at the plot, you will likely need to do some investigation yourself to answer this question fully. If appropriate, you can reference your improved plot from the following question.
    

YOUR ANSWER HERE

Due to the orientation of the graph, it is difficult to understand if the deaths are increasing or decreasing, and saying plummeted with the graph may lead the reader to interpret it as the deaths went down significantly (which is also shown from the graph as its decreasing after 2005/6).

A better way to phrase this is to say the deaths from firearms went up after Florida enacted the "stand your ground" self-defense law in 2005, while having the orientation of the y-axis start from 0 and go up to the maximum value. Looking at the improved graph, it can easily be seen that the deaths rose after 2005/6 - falling in line with the explanation.

There should also be a proper title attached to the graph. The x-axis for the year, should not have a comma for the thousands - should be formatted correctly to represent a year.

Question 3.2ΒΆ

rubric={accuracy:2}

Recreate this plot using Altair or ggplot. Your plot should look as close as possible to mine in the image above.

InΒ [10]:
# YOUR ANSWER HERE
df = pd.read_csv('guns-violence.csv', parse_dates=['Year'])
area = alt.Chart(df).mark_area(color='maroon').encode(
    x=alt.X('Year', axis=alt.Axis(titleFontSize=16)),
    y=alt.Y('Firearm murders', scale=alt.Scale(reverse=True), axis=alt.Axis(titleFontSize=16))
).properties(
    width = 500,
    height = 400
)

point = area.mark_point(color='black', fill='black', size=20)
line = area.mark_line(color='black', size=2)

t = df.query('Year == 2006').copy()
t['text'] = 'Gun law enacted'
text = alt.Chart(t).mark_text(dx=45, dy=-5, color='white').encode(
    x='Year',
    y='Firearm murders',
    text='text'
)
(area + point + line + text).configure_axisX(grid=False)
Out[10]:

Question 3.3ΒΆ

rubric={viz:2}

Create a new and improved visualization to aid in understanding this data. This plot should combat the issues you described in 2.1 and attempt to debunk my claim. Design this visualization to be an effective plot for communication purposes, this includes proper figure size, axis/figure titles, font sizes, colors, etc. Note that there is no "right" answer here.

InΒ [370]:
# YOUR ANSWER HERE
df = pd.read_csv('guns-violence.csv', parse_dates=['Year'])
area = alt.Chart(df).mark_area(color='maroon').encode(
    x=alt.X('Year', axis=alt.Axis(titleFontSize=16)),
    y=alt.Y('Firearm murders', axis=alt.Axis(titleFontSize=16))
).properties(
    title=alt.TitleParams(
        text='Gun Deaths in Florida Rise after 2005',
        subtitle="Number of murders committed using firearms rise after Florida enacted the 2005 self-defense law",
        fontSize=25),
    width = 500,
    height = 400
)

point = area.mark_point(color='black', fill='black', size=20)
line = area.mark_line(color='black', size=2)

t = df.query('Year == 2006').copy()
t['text'] = 'Gun law enacted'
text = alt.Chart(t).mark_text(dx=45, dy=10, color='white').encode(
    x='Year',
    y='Firearm murders',
    text='text'
)
(area + point + line + text).configure_axisX(grid=False)
Out[370]:

2. Approval ratingsΒΆ

The dataΒΆ

The (fictional) state-comparison.csv dataset contains a sample of voters from "my state" (Arizona) and a "neighboring state" (Nevada). Voters in each state were asked to rate their approval of the Arizona state mayor on a scale from 0 - 10,000.

My plotsΒΆ

AltairΒΆ

image.png

ggplotΒΆ

image.png

My claimΒΆ

It looks like the Arizon state mayor's approval ratings are through the roof! The Arizon mayor is clearly doing a better job than the Nevada mayor next door. And as you can see we sampled enough people to get smooth curves, which means that this difference is both likely to be statistically significant and the magnitude is just too big to ignore.

Question 4.1ΒΆ

rubric={reasoning:2,writing:1}

  • Explain which mistakes (if any) I have done in my plot and provide an explanation as to why you think so.
  • Provide a more accurate interpretation of the plot. Be brief and clear in your answer. While you might have some ideas just from looking at the plot, you will likely need to do some investigation yourself to answer this question fully. If appropriate, you can reference your improved plot from the following question.
    

YOUR ANSWER HERE

The claim compares the rating of the Arizona mayor versus the Nevada Mayor, however, based on the description of the data voters from Arizona and Nevada were asked to vote the Arizona state mayor on a scale from 0 - 10,000 - at no point was the rating of the Nevada mayor collected/asked.

In addition, a density plot does not show/indicate counts or number of people sampled - so the claim that just because the curve is smooth, enough people were sampled is false. Looking at the improved graph which also plots the ticks for each data point, clearly not enough people were sampled.

The title is not appropriate - it should be more descriptive on what the graph is trying to show. The axis labels should also be clearer, "the value" could be interpreted as many things and is not clear enough. The legend being on the top blends in / is too close to the title - it being on the right side is easier for me to reference.

Question 4.2ΒΆ

rubric={accuracy:2}

Recreate this plot using Altair or ggplot. Your plot should look as close as possible to mine in the image above.

InΒ [223]:
# YOUR ANSWER HERE
df = pd.read_csv('state-comparison.csv')
density_plot = alt.Chart(df).transform_density(
    'the value',
    as_=['the value', 'Density'],
    groupby=['where']
).mark_area(
    opacity=0.5
).encode(
    x=alt.X('the value:Q', title='the Value', axis=alt.Axis(format='~s')),
    y=alt.Y('Density:Q', title='density', stack=False),
    color=alt.Color('where', scale=alt.Scale(domain=['My state','Neighboring states'], range=['skyblue', 'turquoise']))
).properties(
    title=alt.TitleParams(
        text='Ocean Waves', fontSize=20),
    width=600,
    height=500
).configure_axis(grid=False).configure_legend(
    title=None,
    orient='top')
density_plot
Out[223]:

Question 4.3ΒΆ

rubric={viz:2}

Create a new and improved visualization to aid in understanding this data. This plot should combat the issues you described in 4.1 and attempt to debunk my claim. Design this visualization to be an effective plot for communication purposes, this includes proper figure size, axis/figure titles, font sizes, colors, etc. Note that there is no "right" answer here.

InΒ [13]:
# YOUR ANSWER HERE
df = pd.read_csv('state-comparison.csv')
density_plot = alt.Chart(df).transform_density(
    'the value',
    as_=['the value', 'Density'],
    groupby=['where']
).mark_area(
    opacity=0.5
).encode(
    x=alt.X('the value:Q', title='Voter Approval Rating', axis=alt.Axis(format='~s')),
    y=alt.Y('Density:Q', title='Density', stack=False),
    color=alt.Color('where', scale=alt.Scale(domain=['My state','Neighboring states'], range=['skyblue', 'turquoise']), title='State')
).properties(
    title=alt.TitleParams(
        text='Voters in Arizona like the mayor of Arizona more than voters in Nevada', fontSize=20),
    width=600,
    height=500
)

ticks = alt.Chart(df
).mark_tick(
  color='black', yOffset=240
).encode(x='the value')

(density_plot + ticks).configure_axis(grid=False)
Out[13]:

TrendlinesΒΆ

In this section you will be look at trendlines and error bars.

DataΒΆ

Let's take a look at the last 1000 days over weather data from the cities of Kelowna and Vancouver. You can download the data directly from https://kelowna.weatherstats.ca/download.html and https://vancouver.weatherstats.ca/download.html. The figures below are based on data collected in 2023 data (which you can find as weatherstats_kelowna_vancouver.csv located on the 550 Github repo; note that I have added a column called city to indicate which city the record came from). For you assingmnet, please retreive and based your visualizations on the most recent 1000 days.

My plotΒΆ

AltairΒΆ

ggplotΒΆ

No description has been provided for this image

My ClaimΒΆ

While the annual average temperatures in Canada have consistently been above or equal to the reference value from 1997 onward [1], the cities of Kelowna and Vancouver have seen a steady downward trend in average daily temperatures.

Question 5.1ΒΆ

rubric={reasoning:1}

  • Explain which mistakes (if any) I have done in my plot and provide an explanation as to why you think so.
      

YOUR ANSWER HERE

From the graph, it is clear the data is cyclical in nature, depending on the start/end points chosen - the trend line can be increasing or decreasing. In the example above, at the beginning, the data starts on a high and towards the end - ends on a low, causing the trendline to trend down. However, if you compare the peaks/troughs they are very similar.

A more appropriate way to analyze cyclical data is by comparing year-over-year change. Looking at the improved graphs (either altogether or broken up by year), I only considered points where I can calculate the difference between the average temperature year over year - it is much more clear to see the change in average temperature year-over-year is centered around 0 indicating the original claim of a downward trend is not accurate.

In addition, it is not correct to generalize a trend based on 2 cities in British Columbia to conclude this applies to the entirety of British Columbia. Both Kelowna and Vancouver are towards the South of BC, cities farther up North could have experienced another trend.

The graph title should be changed so that it applies specifically to Vancouver/Kelowna. Axis labels should be more readable and temperature should have a unit because currently, it isn't clear if it's in celsius or fahrenheit.

Question 5.2ΒΆ

rubric={accuracy:2,viz:1}

  • Recreate this plot using Altair or ggplot. Your plot should look as close as possible to mine in the image above. HINT: I used the **ggthemes** package to create the ggplot visualization.
InΒ [355]:
# Your solution here
df1 = pd.read_csv('weatherstats_kelowna_daily.csv', parse_dates=['date'])
df1['city'] = 'Kelowna'

df2 = pd.read_csv('weatherstats_vancouver_daily.csv', parse_dates=['date'])
df2['city'] = 'Vancouver'

df = pd.concat([df1, df2], ignore_index=True)
df

p = alt.Chart(df).mark_point(size=12).encode(
    x=alt.X('date'),
    y=alt.Y('avg_temperature'),
    color=alt.Color('city', scale=alt.Scale(domain=['Kelowna', 'Vancouver'], range=['purple', 'lightskyblue'])),
    opacity = alt.Opacity('city', scale=alt.Scale(domain=['Kelowna', 'Vancouver'], range=[0.3, 0.8]))
).properties(
    title=alt.TitleParams(
        text='BC Temperatures on the decline', fontSize=15),
    width=600,
    height=350
    )

trendlines = alt.Chart(df).transform_regression(
    'date', 'avg_temperature', groupby=['city'], method='linear'
).mark_line(strokeWidth=3).encode(
    x=alt.X('date'),
    y=alt.Y('avg_temperature'),
    color=alt.Color('city', scale=alt.Scale(domain=['Kelowna', 'Vancouver'], range=['purple', 'lightskyblue']))
)

(p + trendlines).configure_axis(
    labelOpacity = 0.5,
    titleOpacity=0.5,
    gridOpacity=0.3
).configure_axisX(
    domainOpacity=0.1
).configure_axisY(
    domainOpacity=0.1
)
Out[355]:

Question 5.3ΒΆ

rubric={viz:1, accuracy:1}

  • Play around with the different trendline options and present one option that you feel captures the trend of the underlying data best. Design the plot to be an effective plot for communication purposes, this includes proper figure size, axis/figure titles, font sizes, colors, etc.
    
InΒ [20]:
# Your solution here

df1 = pd.read_csv('weatherstats_kelowna_daily.csv', parse_dates=['date'])
df1['city'] = 'Kelowna'

df2 = pd.read_csv('weatherstats_vancouver_daily.csv', parse_dates=['date'])
df2['city'] = 'Vancouver'

df = pd.concat([df1, df2], ignore_index=True)

# Calculate year-over-year differences
df['temperature_difference'] = df.groupby(['city', df['date'].dt.month, df['date'].dt.day])['avg_temperature'].diff()

# Exclude values without year-over-year data
filtered_df = df.dropna(subset=['temperature_difference'])


histogram = alt.Chart(filtered_df).mark_bar().encode(
    x=alt.X('temperature_difference:Q', bin=alt.Bin(step=1), title='Temperature Difference'),
    y=alt.Y('count()', title='Count'),
    color=alt.Color('city:N', scale=alt.Scale(domain=['Kelowna', 'Vancouver'], range=['purple', 'lightskyblue']), legend=None),
).properties(
    width=400,
    height=350
).facet(
    column=alt.Column('city:N', title=None),  # Remove default facet title
    title=alt.TitleParams(text='Kelowna and Vancouver Temperatures Remain Unchanged', fontSize=15, anchor='middle', align='center')
)


# Show the histogram
histogram
Out[20]:
InΒ [19]:
df['year'] = df['date'].dt.year
df['temperature_difference'] = df.groupby(['city', 'year'])['avg_temperature'].diff()

filtered_df = df.dropna(subset=['temperature_difference'])

histogram_by_year = alt.Chart(filtered_df).mark_bar().encode(
    x=alt.X('temperature_difference:Q', bin=alt.Bin(step=1), title='Temperature Difference'),
    y=alt.Y('count()', title='Count'),
    color=alt.Color('city:N', scale=alt.Scale(domain=['Kelowna', 'Vancouver'], range=['purple', 'lightskyblue'])),
).properties(
    width=400,
    height=350
).facet(
    row=alt.Row('year:N', title=None),
    column=alt.Column('city:N', title=None),
    title=alt.TitleParams(
        text='Year-over-Year Temperature Differences',
        anchor='middle', align='center', fontSize=14
    )
)

histogram_by_year
Out[19]:

My plot of UncertaintyΒΆ

To get a sense of the variations in temperature between Kelowna and Vancouver, I have created the following plot:

AltairΒΆ

ggplotΒΆ

No description has been provided for this image

Question 6.1ΒΆ

rubric={reasoning:1}

  • Explain which mistakes (if any) I have done in my plot and provide an explanation as to why you think so.

YOUR ANSWER HERE

To get a sense of variation, it is better to use a box plot rather than just a line, to get a better sense of where the quantiles lie, and to understand if there are any outliers. From the lines in the graph above, I can see where the minimum, maximum, and mean are but I am not sure if the min and max are outliers or fall within some quantile range.

Alternatively, looking at the improved plots that use boxplots, I can now see the minimum of about -25 is an outlier and a more reasonable minimum for the data for Kelowna is around -21. Similar observations can also be made for the maximum for both Kelowna and Vancouver.

There should be a descriptive title, the axis labels should be more cleaned up, and the x-axis should have a unit for temperature. The y-axis, it doesn't need to say city when Kelowna and Vancouver are already labeled.

Question 6.2ΒΆ

rubric={accuracy:2,viz:1}

  • Recreate this plot using Altair or ggplot. Your plot should look as close as possible to mine in the image above.
        
InΒ [375]:
# YOUR SOLUTION HERE
line = alt.Chart(df).mark_line().encode(
    x='avg_temperature',
    y='city',
    color=alt.Color('city', scale=alt.Scale(domain=['Kelowna', 'Vancouver'], range=['black', 'black']), legend=None)
) 

means = df.groupby('city')['avg_temperature'].mean().reset_index()

points = alt.Chart(means).mark_circle(size=50, color='black').encode(
    x='avg_temperature',
    y='city'
)
(line + points).configure_axis(
    labelOpacity = 0.6,
    titleOpacity=0.6,
    gridOpacity=0.6
).configure_axisX(
    domainOpacity=0.2
).configure_axisY(
    domainOpacity=0.2
)
Out[375]:

Question 6.3ΒΆ

rubric={viz:2}

  • Create a new and improved visualization to aid in understanding this data. This plot should combat the issues you described in 6.1 and attempt to capture the variability in average temperatures in both cities across seasons. Design this visualization to be an effective plot for communication purposes, this includes proper figure size, axis/figure titles, font sizes, colors, etc. Note that there is no "right" answer here.
InΒ [27]:
# YOUR SOLUTION HERE
box = alt.Chart(df).mark_boxplot().encode(
    x=alt.X('avg_temperature', title='Average Temperature (celsius)'),
    y=alt.Y('city', title=None),
    color=alt.Color('city', legend=None)
).properties(
    title=alt.Title('Range of temperatures in Kelowna exceed that of Vancouver', anchor='middle'), width=600, height=250)
box
Out[27]:

Submission to CanvasΒΆ

When you are ready to submit your assignment do the following:

  1. Run all cells in your notebook to make sure there are no errors by doing Kernel -> Restart Kernel and Run All Cells...
  2. Convert your notebook to .html format using the convert_notebook() function below or by File -> Export Notebook As... -> Export Notebook to HTML
  3. Submit your source file(s) and rendered HTML document to Canvas before the deadline.